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Abstract 

Kernel matrices are popular in machine learning and scientific computing, but they are limited by 
their quadratic complexity in both construction and storage. It is well-known that as one varies the 
kernel parameter, e.g., the width parameter in radial basis function kernels, the kernel matrix changes 
from a smooth low-rank kernel to a diagonally-dominant and then fully-diagonal kernel. Low-rank 
approximation methods have been widely-studied, mostly in the first case, to reduce the memory storage 
and the cost of computing matrix-vector products. Here, we use ideas from scientific computing to 
propose an extension of these methods to situations where the matrix is not well-approximated by a low- 
rank matrix. In particular, we construct an efficient block low-rank approximation method—which we 
call the Block Basis Factorization—and we show that it has 0(n) complexity in both time and memory. 

Our method works for a wide range of kernel parameters, extending the domain of applicability of 
low-rank approximation methods, and our empirical results demonstrate the stability (small standard 
deviation in error) and superiority over current state-of-art kernel approximation algorithms. 

Keywords, kernel matrix, low-rank approximation, data-sparse representation, machine learning, high¬ 
dimensional data 

1 Introduction 

Kernel methods play an important role in a variety of applications in machine learning and scientific com¬ 
puting. They implicitly map any set of data to a high-dimensional feature space via a kernel function. Given 
n data points x\,... ,x n € W l and a kernel function JC : M. d x — > M, the corresponding kernel matrix 

K € M nxn is defined as K l ;l = JC(xi,Xj). A limitation of these methods lies in their lack of scalability: 
0(n 2 d ) time is required for their - computation; O(jr) time is required for a matrix-vector multiplication; 
and 0(n 2 ) time is required even to write down the full kernel. This quadratic cost is prohibitive in many 
data-intensive applications. 

In machine learning, it is popular to tty to avoid this problem by considering low-rank matrix approxi¬ 
mations such as those provided by the Nystrom method [7, 10, 2, 1], Often, however, the matrices are not 
well-approximated by a low-rank matrix. In scientific computing, one way to deal with this latter situation 
is to consider structured matrix factorizations, where one represents exactly or approximately an n x n full- 
rank matrix in such a way that it can be applied to an arbitrary vector in 0(n) or 0(n log(n)) time. While 
low-rank approximations do this, so do many other classes of structured approximations. Perhaps the most 
well-known of this class of methods is the fast multipole method (FMM) [13]. 
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In this paper, we adopt some of the ideas from the FMM to obtain improved kernel approximations in 
machine learning. In particular, we propose a novel randomized procedure to compute a block low-rank 
approximation. Our method does not form the entire matrix, and—since it forms the approximation in 
a block-wise low-rank fashion—it generalizes low-rank methods that have been used recently in machine 
learning. (Informally, it efficiently captures both the “near-field” effects, i.e., interactions between data 
points that are near each other in their original representation, as well as the “far-held” kernel interactions, 
i.e., between data points that arc far from each other in their original representation.) 

Creating a single scheme that by design remains efficient for a wide range of kernel parameters is 
particularly important in light of recent results demonstrating the dependence of kernel matrix properties on 
the kernel parameter. For example, spectral properties such as spectral decay and leverage score uniformity 
(i.e., coherence) depend strongly on kernel parameters such as the radial basis function width parameter and 
the degree of sparsity, and in many cases of interest these matrices are not particularly well approximated 
by low-rank matrices [10]. This is the case for the Gaussian kernel matrices exp (—\\x — y\\\/h 2 ), which 
go from low-rank to diagonal as one changes h from very large to very small; while a traditional low-rank 
approximation will fail abruptly in the small-/; regime, our algorithm is able to better represent the diagonal 
part, and thus better approximate the entire matrix. 

In the remainder of this introduction, we will first (in Section 1.1) summarize our main results; and we 
will then (Section 1.2) describe related recent work. 

1.1 Main Contributions 

In this paper, we provide an accurate kernel matrix approximation algorithm that has a more consistent error 
(i.e., lower standard deviation) than existing methods, that efficiently captures both near-field and far-held 
interactions between data points, and that has linear, i.e., (D(n), complexity. In more detail, here are our 
main contributions. 

• We present a novel matrix format called the Block Basis Factorization (BBF) for machine learning 
applications. This is an efficient data-sparse representation of the kernel matrix with O(n) memory 
(see Section 2). This matrix format is efficient for a wide range of kernel parameters. 

• We propose an 0{n) algorithm to construct the BBF given a desired level of accuracy (see Section 3). 
Among other things, we provide a heuristic technique to compute near-optimal parameters for the 
method. 

• We provide an empirical evaluation of our method on both synthetic and real datasets (see Section 4). 
In particular, this demonstrates the superiority of our algorithm over state-of-art low-rank methods 
and recent block factorizations. 

Our algorithm first clusters the data into k distinct groups, and it permutes the matrix according to these 
clusters. Next, it computes the column basis (resp. row basis) for each row-submatrix (the interaction 
between one cluster and all data points) via a subsampling procedure and a randomized algorithm. Then, 
for every block, it uses the corresponding column and row basis to compress the block, i.e., the “inner” 
block, also using a random sub-sampling algorithm. Consequently, our method computes a block-wise low- 
rank approximation for the k 2 blocks using a set of only k bases for a symmetric kernel matrix (2k for 
a non-symmetric matrix). The resulting framework gives a rank-(rk) approximation with 0(nr + ( rk ) 2 ) 
storage. This should be contrasted with a low-rank scheme that gives a rank-r approximation using 0(nr) 
memory. For the latter case, r is typically much larger (for the same level of accuracy), making our BBF 
computationally more efficient. Note that, in particular, our algorithm takes as input the original feature 
vectors; and thus, by design, it can run without forming the full kernel matrix. 
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1.2 Related Research 


There is a large body of research in scientific computing that aims to accelerate kernel methods. The FMM 
partitions points into different boxes and efficiently approximates interactions between boxes that are far 
apart. From a linear algebraic perspective, this involves using a low-rank approximation for off-diagonal 
blocks. The FMM was originally used for solving kernel summation of potential fields, and it then was 
extended to the solution of Helmholtz [5, 6] and Maxwell's [5, 4] equations. The FMM and similar methods 
work well for dimension up to three, but the complexity grows exponentially with the dimension, and thus 
for higher dimensions they fail due to lack of scalability. The Improved Fast Gauss Transform (IFGT) was 
proposed to alleviate the dimensionality issue for the Gaussian kernel [22]. It is a faster version of the Fast 
Gauss Transform (FGT) [14]. The FGT offers an O (rn + n) algorithm for matrix-vector multiplications with 
Gaussian-kemel matrices of size rn by n. The IFGT further reduces the growth rate of the constant factor 
from exponential to asymptotically polynomial order with respect to dimensionality d (the cost is 0{d p ) for 
d —>• oo and a moderate p). However when d increases, it still requires a very large number of points n to 
start exhibiting a cost with linear growth. 

Another popular approach uses direct low-rank matrix factorizations: given a matrix K 6 M nx ”, find a 
rank-r approximation of K via two tall and skinny matrices U, V € M nxr : K ~ UV T [11]. The singular 
value decomposition (SVD) is the “gold-standard” method of low-rank factorization: it achieves the low¬ 
est reconstruction error, but has a cubic factorization cost; even computing the best rank-r approximation 
takes at least 0(n 2 r) time with traditional methods. Recent work in Randomized Numerical Linear Algebra 
(RandNLA) has focused on using randomized matrix algorithms to speed up low-rank matrix approxima¬ 
tions [17], For example, one RandNLA algorithm is the randomized SVD algorithm of Halko et al. [15] 
that computes a low-dimensional approximation of the range space of K. Relatedly, Liberty et al. [16] and 
Sarlos [19] do a random projection of the matrix onto a /^dimensional (p > r) subspace. These methods 
have an improved 0(n 2 logr) complexity, but they still take the matrix K as input and thus need to form 
the entire matrix. 

A related algorithm is the Nystrom method [7, 10], which randomly subsamples the columns of the 
matrix and uses them to construct a low-rank approximation. A naive Nystrom algorithm uses a uniform 
sampling, which is computationally inexpensive, but which works well only when the matrix has uniform 
leverage scores, i.e., low coherence. Improved versions of Nystrom have been proposed to provide more 
sophisticated ways of sampling the columns: for example, one can sample with probabilities proportional 
to the statistical leverage scores—which can be approximated with the algorithm of Drineas et al. [8] in 
essentially random projection time [10]— to obtain very strong results; or one can use the algorithm of 
Alaoui and Mahoney [1] to compute a variant of the leverage scores without even forming the entire ma¬ 
trix, thereby obtaining improved statistical bounds. The random sampling algorithm of Engquist et al. [9] 
gives an 0(r 2 (m + n)) method to choose important rows/columns alternatively using the QR factoriza¬ 
tion. K-means Nystrom [23] uses /.'-means clustering to choose the landmark points instead of sampling 
real points. A related line of research includes random feature maps: for example. Random Kitchen Sinks 
(RKS) [18] proposed a random feature map using the Fourier transform of the kernel function to map data 
to an Euclidean inner-product space. 

There have also been several approaches to try to address problems that are not well-approximated by 
low-rank matrices. In particular, the idea of combining the low-rank approximation with a preclustering of 
the data points has been recently used in machine learning. For example, the clustered Low-Rank Approx¬ 
imation (CLRA) [20] performs a block-wise low-rank approximation of the kernel matrix, but it requires 
forming the entire matrix and has only been applied to social network data; and the Memory Efficient Kernel 
Approximation (MEKA) [21] improved upon CLRA by avoiding the formation of the entire matrix. (While 
interesting, we found that, in our benchmarks, this method is not stable, and produces errors with significant 
variations, e.g., a large standard deviation over multiple trials.) 
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We briefly discuss the main differences between our algorithm and MEKA. We compute the low-rank 
basis vectors from a larger space, that is, from the entire row-submatrices instead of the diagonal blocks 
(resulting in a more accurate basis). We use a sampling procedure that is more sophisticated than uniform 
sampling, but still has linear cost. We provide a parameter selection method that computes the near-optimal 
number of clusters k, as well as the ranks for each cluster, given a desired accuracy. Therefore, for a given 
memory cost (which can be closely related to the computational cost of a fast matrix-vector product), our 
algorithm captures more information from the matrix while still keeping linear complexity. 


2 Block Basis Factorization (BBF) of Matrix 


In this section, we define the Block Basis Factorization (BBF) of a matrix. Given a matrix M € M” xn 
partitioned into k by k blocks, let M t J denote the th block for i = 1,.... k. Then the BBF of M is 
defined as: 

M = UCV T (1) 

where U is a block diagonal matrix with the i-th diagonal block II, being the column basis of ? for all 
j, V is a block diagonal matrix with the j-th diagonal block Vj being the row basis of Mij for all i, and 
C is a k by k block matrix with the th block denoted by Cjj = Uf In practice, C l: j is an 

approximation of Uj A/ XJ V) up to a desired accuracy e. 

Based on the definition of the BBF, we have that U r is the column basis of M r : and V) is the row basis 
of M : j, where 

Mi - = [Mi,i • • • M iik ] , 
and M : j = [M^ ••• M^\ T 

are the i-th row-submatrix and j-th column-submatrix, respectively. Hence, the width of U, is the numerical 
rank of for a desired accuracy e, and the width of Vj is the numerical rank of M-j. If the numerical 
ranks of all row- and column- submatrices arc bounded by r, then the memory cost for the BBF is given by 
0(nr + ( rk ) 2 ). Further, if k < y/n and r is a constant independent of n, then the BBF gives a data-sparse 
representation of M with linear memory (see Figure 1). In this case, the complexity for both storing the 
BBF and applying it to a vector will be linear. 
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Figure 1: M = U C V T 


3 Fast Algorithm for BBF 

In this section, we provide a fast algorithm, with linear complexity, to obtain a BBF. We state our main the¬ 
orem and the basic idea of our algorithm in Section 3.1. In Section 3.2 and Section 3.3 we give details about 
BBF construction and parameter selection, respectively. We consider the kernel matrix to be symmetric in 
this paper; the non-symmetric case can be derived in a similar fashion. 
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3.1 Main Theorem and Algorithm 

We consider n data points X = {ZiKLi, x i € partitioned into k clusters C = and a shift- 

invariant kernel function TC : x —> M. saassWe start by stating a theorem which gives a bound on the 

reconstruction error of the best rank-r approximation of the interaction matrix /C (C t ,: 

Theorem 3.1. Assume that the kernel function 1C is a shift-invariant kernel, i.e., /C(x, y) = f(t (x — y)) = 
g x - y (t) where g x - y (t) is Lipschitz continuous with the Lipschitz constant n. Given a k-cluster partition 
C = {C, } j _ | of the point set X = {A }” = i ■ A £ M d , if the radii of all clusters are bounded by a constant R 
independent of n and d, then, for any cluster Ci, 

|| K(Ci ,:) - K r (Ci, :)||f < ^r-^VlC^R 
where K r (Ci ,:) is the best rank-r approximation to K(Ci ,:). 

Similarly, we also have 

|| K(:,Ct) - K r (:,Ci )||f < ^r~ l,d s/\C^R- 


The proof of Theorem 3.1 follows a similar strategy of Section 6.1 in Si et al. [21]. The error bound 
suggests that given r and d, a small radius R is preferred to lower this bound. To reduce R, we use A--means / 
fc-center algorithm (see Appendix A) as the space partitioning method. They differ in the objective functions 
and thus give slightly different clusters. Empirically we find that neither of them is superior to the other for 
all datasets; /c-means tends to cluster points more evenly while /.•-center tends to cluster points into compact 
groups. 

Next, we state the basic idea of our algorithm. Suppose we are already given the partitions for data 
points, as well as the rank n used for cluster C L . We first permute the matrix according to the clusters, such 
that every block of the matrix represents the interaction between two clusters. The permuted matrix is in 
Equation 3. 


M = PKP t 



Cl 

C 2 

Ck 

Cl 

/Mi,i 

XI 1,2 

••• Ml, A 

C 2 

M‘2. 1 

AX 22 

XI 2, k 

Ck 

\m m 

XIk,2 

■ ■ ■ XI kk j 


( 3 ) 


where P is a permutation matrix, Mjj is the interaction matrix between cluster C % and cluster Cj. Then 
to get a BBF, we need to compute a set of k bases and use them to compute the inner matrices for all the 
blocks. 

1. Compute basis. The most accurate way for computing the column basis Ui for the row-submatrix Ap : is to 
apply an SVD on Mj >: . But this is too slow, so we restrict ourselves to subsampled indices, i.e., we sample 
from all the columns (denoted as II), and apply the randomized SVD in Halko et al. [15] (see Appendix B) 
on the subsampled matrix M,.rr to obtain U r . 


2. Compute inner matrices. For block Mij, a matrix that can best approximate M t j in the form of 
UiCijUf is given by {Ui)'M^j{Ujy, where (-)t denotes the pseudo-inverse. However, forming each 
block Mij will result in constructing the entire matrix, this 0{n 2 d) cost renders the method impractical for 
massive datasets. To reduce the cost, we again restrict ourselves to subsampled indices. We sample columns 
J and rows X from M t j, and compute in the form of (U t (I. :))tMjj(X, , :) T )t. We can 

further reduce the cost by skipping the computation for off-diagonal blocks with small entries. 

Our fast algorithm for kernel matrix approximation is presented in Algo. 1. 
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Algorithm 1 (Fast factorization for BBF) 

1: Input: the number of clusters k, clusters C ,, near-optimal rank r, for C, with i = 1 ,,k 
2: Output:: U and C 
3: for i = 1,..., k do 

4: Sample r, +1 columns (denoted II) from M* : using the randomized sampling algorithm in Appendix 

C 

5: Construct Mj n 

6: Apply a randomized SVD on M t j\ to compute U t 

7: end for 

8: for Ci,Cj {hj = 1, • ■ •, k) do 

9: if cutoff criterion is satisfied then 

10 : C i: j = 0 

it: else 

12: Sample r, rows T r , r ? columns T c uniformly at random, and let X = n r U F r , J = II C U r c , 

where 1 I r /II c are the important rows/columns computed in step 4. 

13: Cij = (Ui(X, :))t MijfrJXUjiJ, :) r )t 

14: end if 

15: end for 


3.2 BBF Construction 

3.2.1 Fast computation of basis 

In order to achieve linear complexity in computing the bases, we consider restricting ourselves to a subset of 
the dataset X. This is to say, when computing Ui € M”" 1 xr *, instead of working on the entire row-submatrix 
M, j: € M n,xn , we can work on a carefully selected set of sub-columns of M K -. These sub-columns will 
ideally preserve important information about the basis. 

The procedure is then to sample from columns and compress the sampled matrix. We first sample n+l(l 
is an over sampling parameter) columns II from M, j: using the randomized sampling algorithm in Engquist 
et al. [9] (see Appendix C). It samples r columns by alternating between important columns given rows and 
important rows given columns, and has O (r 2 (rn + n)) operations for a matrix of size m by n. After we have 
found the submatrix M t j], we can then apply the randomized SVD on M t \\ to get U r . With the sampling 
step, we avoid the construction of the entire matrix M. 

3.2.2 Fast computation of inner matrices 

For the interaction matrix Mjj between cluster C t and Cj, we seek a matrix C l)} that can best approximate 

M t j in the form of UjCijUj. Since for most machine learning applications, the Frobenius norm is often 

used, we chose to find a low-rank approximation UiCijUj that captures most of the Frobenius norm of 

Mi j. This is equivalent to solving the following optimization problem: 

min ||- U i CijUj'\\ F 

Ci,j 

The minimizer is given by C- h j = (JUi)', where f denotes the pseudo-inverse. 

Sample columns and rows. To avoid the quadratic cost in forming all blocks, we again restrict ourselves 
to subsampled indices. 

Proposition 3.2. If a matrix M € M mxn can be written as M = UCV T , where U € M mxri and V € 
M n x r ' 2 have full column rank, then we subsample l \ > r\ rows from M, denoted as I, I 2 > r '2 columns from 
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M, denoted as J. If both U (X,:) and V (J ,:) are full rank, then we have 

C = (F(J,:) T ) t 

The proof can be found in Appendix D. 

This suggests that to get Qj, instead of constructing Mij, we can sample columns and rows of M t j 
and work on a smaller sub-space. That is, we sample rows X and columns J from Mij, and then compute 
the inner matrix by the form stated in Proposition 3.2. In fact, there is a way to likely satisfy the full rank 
requirement of UiZ .:) and V( J,:) without extra cost. We can choose X = lf r U T r , where II r is the set of 
important rows computed when sampling columns for computing Ui, and T r is a set of uniformly sampled 
indices. The same method applies when choosing J. 

Skip computation for blocks with small entries. If an accuracy e is desired for the approximation, we can 
skip the computation for blocks whose entries are sufficiently small. For two clusters that are far apart, the 
entries can become very small for a range of kernel parameters, in which case the computation of the inner 
matrix can be skipped and both memory and time savings can be gained. 

Low-rank on inner matrices. When the rank of an inner matrix C , ) ? € M n ‘ xnj is smaller than half 
of min{nj,rij}, we can represent C- t j as a low-rank factorization. The improvement in memory will be 
significant when the number of clusters k is large. We can see this by noting that the storage of the BBF is 
^2i =1 riiri + A;) 2 , where the second term comes from inner matrices. When k is large, the second 

term is usually large. Thus if the memory of each inner matrix is reduced by half, we can reduce the second 
term by half. Empirically, we found that k is usually small for our test cases, therefore we did not see a 
significant gain by applying this step. Hence we will keep this as a theoretical interest but skip this step in 
our algorithm. 

3.3 Optimal Parameter Selection 

In this section we will provide a heuristic method to produce near-optimal parameters for BBF. We take n 
points X, and a desired accuracy e as input, and will compute the optimal number of clusters k, the index 
set for different clusters X, and the estimated rank n for the column basis of row i. Our goal is to find the 
parameters that can satisfy the desired accuracy while achieving the lowest memory cost. 

Compute near-optimal ranks. The diagonal block is the interaction of a cluster with itself, while the 
off-diagonal blocks are the interactions between clusters. It is natural to expect the rank of the off-diagonal 
blocks to be smaller than that of diagonal blocks. Therefore to achieve e accuracy for all blocks, we only need 

to achieve that for all the diagonal blocks. If for each block Mjj € M n ' ,;Xr 'A we have < ^pr-e 2 , 

then the reconstruction error will be bounded by e: 

\\M-Mf F Eij rurij 2= 2 

||M||2 - n 2 C r 

where n is the number of total data points. Therefore, the optimal rank r, is given by min{/;: | 'fffjLk+i 17 j < - 
e 2 }, where oq > c r 2 > ... > cr ni are singular values for 

Compute near-optimal number of clusters k. Our next goal is to look for k that achieves the lowest 
memory. We proceed by solving the following optimization problem: 

k k 

minimize f(k) = ^ rijr, + n ) 2 , 

i— 1 2—1 

Tli 

s.t. n = minjfc | ^ < T ^ L \\M\\ 2 F e 2 }, Vi 

j=k+ 1 
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where f(k) is the memory requirement of the BBF when using k clusters. We observe that /(/,:) is approx¬ 
imately convex in the domain [1 ,0(y/n)\, which enables us to use an O(logn) search algorithm for the 
mi ni miz er. 


3.4 Complexity Analysis 


To simplify the analysis and ensure that the factorization and application time are linear, we further require 
the numerical ranks of Mj >: (i = 1to be bounded by r max , where r max is a small constant independent 
of n. When we discuss the numerical rank of a matrix e M niXn J, we mean the smallest integer k such 


that 


min{nj ,rij } 


E 

j=k +1 


a] < 




where <j\ > ct 2 > • • • > a m i n { ni)rLj } > 0 are the singular values of M t j, and e stands for the desired 
accuracy. The complexity for each step can be found in Table 1. The application refers to performing a 
matrix-vector multiplication. Note that when computing the inner matrices, k can reach up to O(^JTi) while 
still maintaining a linear complexity for this step. 


Table 1: Complexity table, where k is the number of clusters, n r is the number of points in cluster C, , r* is the 
estimated rank for the basis of row i, n is the total number of points, r max is a small constant (independent 
of n). 


Parameter selection 


Factorization Application 

Each block 

3 THr^ 

Basis 

0(nkr^ ax ) 

Compute f(k) 

^(^Gnax) 

Inner matrix 

El, E‘_i *rf rj + 2r,r? ° (nr "“ ) 

Total 

0(r'L x n log n) 

Total 

°( n krl ax ) 


4 Experimental Results 

In this section, we evaluate the performance of our proposed algorithm on synthetic and real datasets 1 listed 
in Table 2. We use BBF to denote our algorithm, which includes constructing the BBF with parameters 
computed from Section 3.3, and applying the BBF to a vector. We benchmark our method against other 
state-of-art kernel approximation methods. All experiments were run on a computer with 2.4 GHz CPU and 
8 GB memory. 


Table 2: Real datasets used in our experiments, where n is the number of data points, and d is the dimension 
for each point. 


Dataset 

n 

d 

Dataset 

n 

d 

Abalone 

4,177 

8 

Wine quality 

4,898 

11 

Pageblock 

5,473 

11 

Census house 

22,748 

16 

Pendigits 

10,992 

16 

Forest covertype 

581,012 

54 


’All the datasets are downloaded from UCI repository [3], 
















4.1 Quality of BBF 


We first investigate how well the factorization can approximate the exact kernel matrix as we vary memory 
cost and kernel parameters, and how consistently small the error is. The accuracy is computed via the 
relative error ' , where K is the approximated kernel matrix, I\ is the exact kernel matrix, and 

|| • \\p stands for the Frobenius norm. The kernel functions used in the experiments are the Gaussian kernel 
!C(x, y) = exp(— ||x — y\\ 2 /h 2 ) and the Laplacian kernel /C(x, y) = exp(— ||x — y\\i/h), where x and y are 
data points, and h is the kernel parameter. We now list the kernel approximation methods we are comparing 
against; details about parameter selection for each method are given in Appendix E: 

- The standard Nystrom (Nys) [7] 

- K-means Nystrom (kNys) [23] 

- Leverage score Nystrom (IsNys) [8] 

-Memory Efficient Kernel Approximation (MEKA) [21] 

- Random Kitchen Sinks (RKS) [18] 

In our experiments, we normalized the data so that each dimension (feature) of data has zero mean and 
standard deviation one. 

4.1.1 Quality of BBF for various memory costs 

We will investigate how our method behaves as more memory is available. Note that the memory cost is a 
close approximation of the running time for a matrix-vector multiplication. In addition, computing memory 
is more accurate than running time, which is sensitive to the implementation and algorithmic details. In 
our experiments, we increase the memory cost by requiring a higher accuracy in BBF, and compute the 
input parameters accordingly for other methods to ensure all methods have roughly the same memory (the 
memory for a ra nk -r low-rank approximation is computed as nr). For each memory cost, the experiment 
is carried out 20 times to demonstrate the stability of our algorithm with respect to the random sampling 
errors. The results are shown in Figure 2. 

As can be seen from Figure 2, the accuracy of all methods generally improves as memory increases. 
When memory is low, BBF is comparable with alternative methods; as more memory is available, BBF 
exhibits a significantly higher accuracy than the other methods. We also observe from the multiple trials that 
our method has a smaller standard deviation in error. 

4.1.2 Quality of BBF for various kernel parameters 

In this section, we will show that when the kernel matrix is not well approximated by low-rank methods, 
BBF still can approximate the matrix well. We will illustrate this point on two datasets, coupled with similar 
observations from Gittens and Mahoney [10]. In the experiments, we fix the memory for all methods and all 
kernel parameters to be roughly the same, and present the average accuracy vs -jA in Figure 3. As we can see 
from Figure 3, when X gets larger, all low-rank methods (including exact SVD) give worse relative error. 
The results are explained by the statistics in Table 3. This table shows that the property of kernel matrix 
changes with the kernel parameter: as the kernel parameter h gets smaller, the kernel matrix becomes less 
“nice” in terms of rank, that is the kernel matrix goes from low-rank to diagonally dominant. This can be 
seen from both the stable rank \ ',. T ], an underestimate of the rank of matrix M, and ., r 'T , the portion 
of the Frobenius norm that is preserved in M r . Therefore, we expect the behavior for all methods (for a 
given memory) to be worse as the rank of the kernel matrix becomes higher. This is what we see in Figure 3. 
However, BBF keeps a low error regardless of changes in the matrix property. This is because, as the matrix 
becomes more diagonal, we can increase the number of clusters k to better approximate the diagonal part, 
and the off-diagonal blocks can be sparsificd due to their small entries. With the flexibility of varying the 
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relative error 





Figure 2: Comparisons (loglog scale) of BBF (our algorithm) with different state-of-art kernel approxima¬ 
tion methods on various real datasets. All the plots have the same legend. For each method, we show the 
improvement in accuracy as the memory cost is increased. For each memory cost, we present the results of 
20 runs of each algorithm. Each run is shown with a symbol, while the lines represent the average accuracy. 
h is the kernel parameter. The exact memories are 1.7e7, 1.2e8, 2.4e7 and 3.0e7 for subplots from left to 
right, respectively. For subplots 2.1, 2.2 and 2.3, the Gaussian kernel exp(— ||x — y 111/ Ir) is used; for sub¬ 
plot 2.4, the Laplacian kernel exp(— \\x — y|| i /h) is used. The exact leverage score is used for the leverage 
score Nystrom. The input number of clusters k for MEKA was set to 5. Other values of k for MEKA were 
also tested, but they did not give better results. The results for MEKA (orange crosses) spread out from the 
mean (orange dotted line), indicating a large standard deviation in the error. Note that the speed-up depends 
on the choice of h. A smaller value of h increases the speed-up of BBF even further. 
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number of clusters, BBF can efficiently use the memory and is favorable in all cases, from low-rank to 
diagonal. Combined with the results from Section 4.1.1 where the rank of kernel matrix is low, we have 
demonstrated that our proposed scheme works for a large range of kernel parameters. 




Figure 3: Plots for relative error vs -jX for different kernel approximation methods. For the two BBF methods 
(our algorithm), the dark blue one fixed the number of clusters k to be 20, while for the one denoted by the 
light blue line the k was allowed to vary. The memory costs for all methods and all kernel parameters 
are fixed to be roughly the same. We use ^ to describe the memory footprint, where r is the rank used 
for low-rank methods, n is the number of data points. The kernel function we use is the Gaussian kernel: 
exp(—1|x — y Hi//? 2 ). On the x-axis, A- varies from small to large. As the rank of the matrix increases, the 
gap (in terms of accuracy) between low-rank approximation methods and BBF becomes larger. 


Table 3: Summary statistics for abalone andpendigits datasets with the Gaussian kernel exp(— ||x— yW^/h 2 ), 
where r is the rank, A r+ i and X r arc the r+l-th and r-th largest eigenvalues, M r is the best rank-r approxi¬ 
mation for M, and M is the exact matrix, l r is the r-th largest leverage score scaled by 


Dataset 

r 

t 

7? 


Ar+1 

i nnl 

M r 

If 

l r 

ww 

X r 

\M\ 

F 



0.25 

2 

0.9712 

99.9991 

4.34 



1 

4 

0.9939 

99.8688 

2.03 



4 

5 

0.9898 

97.3333 

1.94 

Abalone 

100 

25 

15 

0.9979 

72.0058 

5.20 



100 

175 

0.9988 

33.4007 

12.60 



400 

931 

0.9991 

19.4731 

20.66 



1000 

1155 

1.0000 

16.5210 

20.88 



0.1 

3 

0.9987 

99.9937 

2.39 



0.25 

6 

0.9956 

99.7912 

1.83 



0.44 

8 

0.9920 

98.9864 

1.72 

Pendigits 

252 

1 

12 

0.9945 

93.6439 

2.02 



2 

33 

0.9965 

77.6317 

2.90 



4 

207 

0.9989 

49.6018 

4.86 



25 

2794 

0.9998 

19.8506 

14.78 
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4.2 Complexity of BBF 


We first evaluate the growth of running time on synthetic datasets, with a comparison against the IFGT; then 
we examine the complexity on two real datasets. The kernel function used in this section is the Gaussian 
kernel. 

The IFGT has a linear complexity for both time and memory with a fixed dimension d. However, when d 
increases (d > 10), the number of data points n need to be large enough for the algorithm to exhibit a linear 
growth. BBF, however, does better than the IFGT in the sense that when d gets large, it does not require n to 
be very large to show this linear behavior. To demonstrate this, we investigate the performance of BBF and 
the IFGT on a synthetic dataset with different dimensions. Figure 4 shows that for a desired tolerance 10 -3 , 
when d = 5, the IFGT behaves as 0(n), but when d = 40, the behavior of the IFGT is quadratic. However, 
we can see that for all cases BBF grows linearly, and produced errors in range [(9(10 4 ). 09(10 3 )]. Note 
that the IFGT is in C++ while our algorithm is in MATLAB. 




N 


N 


4.1 Application time (d = 5) 


4.2 Application time (d = 40) 




4.3 Total time (d = 5) 


4.4 Total time (d = 40) 


Figure 4: Timing for IFGT and BBF (loglog scale) on a synthetic dataset with dimension d = 5 and 40, 
where we generated 10 centers uniformly at random in a unit cube, and randomly generated data around the 
centers with standard deviation 0.1. The desired accuracy given for both algorithms is 10 3 , and the kernel 
parameter h was set to 0.5. All plots have the same legends. The plots at top arc for application (matrix- 
vector product) time, and plots at bottom are for total time (factorization plus application). The timing for 
BBF is linear for all cases, while the timing for the IFGT behaves differently as d changes. 


Next, we will examine the linear complexity on real datasets. The results are shown in Figure 5. These 
results are consistent with those shown on the synthetic datasets and conclusively demonstrate that our 
algorithm has a linear complexity. 
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number of sampled data point 

5.1 Census housing, h = 2 



number of sampled data point 

5.2 Forest covertype, h = 2 


Figure 5: Factorization time (loglog scale) for data from two real datasets. The left is census housing and 
the right is forest covertype. To illustrate the linear growth of the complexity, we generated datasets with a 
varying number of points with the following strategy. We first clustered census housing and forest covertype 
into groups Cf,, Cf. (k = 15), and sampled a portion p% from every group, then increased p. To avoid 
the other factors affect the timing, we fixed the input rank for each clusters. As we can see, the timing for 
both datasets grows linearly. 


5 Conclusions 

In this paper, we proposed a linear complexity algorithm that works for a large range of kernel parameters, 
and that achieves comparable and often orders of magnitude higher accuracy when comparing with other 
state-of-art kernel approximation methods, with the same memory cost. Although our algorithm makes use 
of randomized algorithms, the error in the method is very consistent and has a small standard deviation. This 
is in contrast with other randomized methods for which the error fluctuates much more significantly. 
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Appendix 

Appendix A Clustering algorithms 


For a given set of data points, we need an efficient algorithm to partition the data points. Given n data points, 
and the number of clusters k, the fc-center and /c-means problems partition the points into k clusters. The 
difference between fc-center and /e-means is the objective function: fc-center minimizes max,; rnax„ G c’ ): ||v — 
q11 2 , while /e-means minimizes YlveC ll u ~ °i\\ 2 , where c, is the z-th cluster and C, is the center/centroid 
of C,. Both of them are NP-hard problems, but there are efficient approximation algorithms that converge 
quickly to a local optimum. Farthest-point clustering [12] is a 2-approximation algorithm for /e-ccntcr 
problem with a cost of 0(nkd ) for each iteration. Standard K-means algorithm alternates between best 
assignments given centroids, and best centroids given assignments; it also has a cost of 0{nkd) for each 
iteration. 


Appendix B Randomized SVD 

To obtain a BBF, one step is to compute the column basis Ui of the row sub-matrix Mj >: (i = 1,..., k). This 
can be achieved via an exact SVD, but the cost is 0{mn 2 ) for a matrix of size m by n (m > n ). Flere, 
we use the randomized SVD introduced in [15]. It reduces the cost of computing a rank-r approximation of 
a matrix of size m by n to 0(mn(r + /)), where l is an oversampling parameter. We briefly describe the 
algorithm for obtaining a rank-r approximation UT,V* of a matrix M £ W nxn . 

Notations. Let r denote the desired rank, l denote the oversampling parameter, and q denote the iteration 
parameter. 

1. Preliminaries. We randomly generate a Gaussian matrix Q £ M nx ( r+i ). 

2. Compute Basis. Apply a QR decomposition on Mil to get Q. Repeat the following steps for q times: 
do a QR decomposition of M*Q to get an updated Q, apply QR again on MQ to get Q. 

3. Construct SVD. Apply an SVD on an r by n matrix B = Q*M and get B = UW*. Then, the SVD 
of M is given by: U = QU, E = t, V = V. 

Appendix C Randomized sampling method with linear cost 

We can further reduce the cost for constructing a basis by working on a sub-sampled space. Hence, we 
need to seek an efficient and effective method of sampling columns from a matrix M £ M. mxn . We use a 
sampling algorithm [9] with complexity 0(r 2 (m + n)), where r is the number of samples. We describe the 
algorithm below: 

Notations. Let M £ E mxn be the matrix we sample columns from, let r denote the desired number of 
columns, l denote the oversampling parameter. 

1. Get important columns. Uniformly sample r rows from M, and denote the index set as F r . Apply a 
pivoted QR factorization on Mr r , : to get the top r important column indices, denoted as n c . 

2. Get important rows. Uniformly sample l columns, and denote the index set as T c . Update n c = 
r c U n c . Apply a pivoted LQ factorization on M- ; n c to get the top r + l important row index set, 
denoted as n r . 

3. Update important columns. Uniformly sample l rows, and denote the index set as r r , update n r = 
II r U r r . Apply a pivoted QR factorization on A/n, . : to get the top r important columns. 
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Appendix D Proof of Theorem 3.2 


Proof. For the simplification of notation, let us denote U = U(X .:), V = V{J .:), and M = M (X.J). 
Directly from M = UCV T , we get M = UCV T . By multiplying W and (V T 'y on both sides, we have 
U^MV^ = U^UCV T (V T )^. Since U is a tall and skinny full rank matrix, we have U^U = I, the same 
goes for V T (V T )^. Therefore, M(V T )^ = C. □ 

Appendix E Parameters and implementations used for other kernel approx¬ 
imation methods 

The standard Nystrom. We uniformly sample 2k columns without replacement for a rank k approxima¬ 
tion. 

K-means Nystrom. We use the code provided by the author. 

Leverage score Nystrom. We compute exact leverage scores ( 0(n 2 )) and sample 2k columns with replace¬ 
ment for a rank k approximation. 

Memory Efficient Kernel Approximation. We use the code provided by the author. 

Improved Fast Gauss Transform. We use the code provided by the author. The code is implemented in 
C++ and we used its MATLAB interface. 
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